Exact pattern matching: Adapting the Boyer-Moore algorithm for DNA searches
نویسنده
چکیده
Exact pattern matching aims to locate all occurrences of a pattern in a text. Many algorithms have been proposed, but two algorithms, the Knuth-Morris-Pratt (KMP) and the Boyer-Moore (BM), are most widespread. It is the basis of some approximate string matching algorithms like BLAST, and in many cases it is desirable to locate an exact rather than approximate matches. Although several studies included measures with small alphabets, none of them specifically designed an algorithm to target nucleotide sequences. Since there are also no application programming interfaces available for pattern matching in nucleotide sequences, these two issues were aimed to be resolved. A portion of the Chlamydomonas reinhardtii genome (30 mega bases) was searched with queries ranging from 10 to 2000 nucleotides and an alternating number of matches between one and 25000. The results indicate that the use of two of the algorithms developed in this study is sufficient to efficiently cover the complete search space as presented in the experiment conducted here. Thus the aim of implementing an algorithm specifically targeting pattern matching in nucleotide sequences and making it available to the general public as an advanced programming interface was achieved. All algorithms are freely available at: http://bioinformatics.iyte.edu.tr/supplements/peerj/. PeerJ PrePrints | https://doi.org/10.7287/peerj.preprints.1758v1 | CC-BY 4.0 Open Access | rec: 19 Feb 2016, publ: 19 Feb 2016
منابع مشابه
Project 2: Pattern Matching in Compressed DNA Sequence
Space efficient storage of large genome sequences requires good compression techniques. However, if these sequences need to be decompressed, before any processing can be done over them, the advantage of compression is lost. New techniques are required to extend the traditional pattern matching algorithms to work directly on the compressed sequence. This saves space in memory, requires less disk...
متن کاملImplementation of exact-pattern matching algorithms using OpenCL and comparison with basic version
In big text-processing tasks, the exact patternmatching problem still remains time consuming. As algorithms asymptotically faster than existing ones cannot be developed, there is a need to use another approach to promote efficiency. Thus, parallel computing is able to significantly speed up the process of the exact pattern-matching problem solving. That is why the current work is focused on par...
متن کاملFast pattern-matching on indeterminate strings
In a string x on an alphabet Σ, a position i is said to be indeterminate iff x[i] may be any one of a specified subset {λ1, λ2, . . . , λj} of Σ, 2 ≤ j ≤ |Σ|. A string x containing indeterminate positions is therefore also said to be indeterminate. Indeterminate strings can arise in DNA and amino acid sequences as well as in cryptological applications and the analysis of musical texts. In this ...
متن کاملFast search in DNA sequence databases using punctuation and indexing
Exact pattern searching in DNA sequence databases has applications in identification of highly conserved regulatory sequences, the design of hybridization probes, and improving performance of approximate homology searching tools such as BLAST and BLAT. We propose a new pattern searching algorithm, CompressedPunctuated-Boyer-Moore (cp-BM), to enhance exact pattern match searches of DNA sequences...
متن کاملString Matching in the DNA Alphabet
Searching for occurrences of string patterns is a common problem in many applications. Various good solutions have been presented for string matching. The most efficient solutions in practice are based on the Boyer–Moore algorithm.1 A typical question in molecular biology is whether a given sequence has appeared elsewhere. In the following, we will concentrate on searching for exact occurrences...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PeerJ PrePrints
دوره 4 شماره
صفحات -
تاریخ انتشار 2016